PHAST: Spoken Document Retrieval Based on Sequence Alignment
نویسندگان
چکیده
This paper presents a new approach to spoken document information retrieval for spontaneous speech corpora. Classical approach to this problem is the use of an automatic speech recognizer (ASR) combined with standard information retrieval techniques, based on terms or n-grams. However, state-of-the-art large vocabulary continuous ASRs produce transcripts of spontaneous speech with a word error rate of 25% or higher, which is a drawback for retrieval techniques based on terms or n-grams. In order to overcome such a limitation, our method is based on a sequence alignment algorithm drawn from the field of bioinformatics to search “sounds like” sequences in the document collection. These matching sequences are potentially misrecognized words from the ASR and can be used to retrieve relevant passages and documents from the collection. Our approach doesn’t depend on extra information provided by the ASR. We have evaluated and compared our approach to others in the state of the art in both spoken document retrieval and spoken passage retrieval tasks. The evaluation has been performed in the context of Question Answering using a corpus of automatic transcripts from the Spanish and European parliaments. The results show that our method outperforms by 10 points traditional term based search and n-gram search on automatic transcripts.
منابع مشابه
Spoken Document Retrieval Based on Approximated Sequence Alignment
This paper presents a new approach to spoken document information retrieval for spontaneous speech corpora. The classical approach to this problem is the use of an automatic speech recognizer (ASR) combined with standard information retrieval techniques. However, ASRs tend to produce transcripts of spontaneous speech with significant word error rate, which is a drawback for standard retrieval t...
متن کاملPhAST: Pharmacophore alignment search tool
We present a ligand-based virtual screening technique (PhAST) for rapid hit and lead structure searching in large compound databases. Molecules are represented as strings encoding the distribution of pharmacophoric features on the molecular graph. In contrast to other text-based methods using SMILES strings, we introduce a new form of text representation that describes the pharmacophore of mole...
متن کاملDocument Image Retrieval Based on Keyword Spotting Using Relevance Feedback
Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...
متن کاملText-based similarity searching for hit- and lead-candidate identification
The Pharmacophore Alignment Search Tool (PhAST) is a string-based approach to virtual screening. Molecules are represented by linear sequences which describe their respective pattern of interaction possibilities. The problem of molecule linearization is tackled by applying Minimum Volume Embedding in combination with a Diffusion Kernel to the molecular graph [1,2]. Linear representations are co...
متن کاملPackage 'rphast' Title R Interface to Phast Software for Comparative Genomics
December 13, 2013 Copyright The code in src/pcre is Copyright (c) 1997-2010 University of Cambridge. All other code is Copyright (c) 2002-2010 University of California, Cornell University. Maintainer Melissa Hubisz License BSD_3_clause + file LICENSE Title R interface to PHAST software for comparative genomics Author Melissa Hubisz, Katherine Pollard, and Adam Siepel Desc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008